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ASYMPTOTIC RESULTS FOR MAXIMUM LIKELIHOOD 
ESTIMATORS IN JOINT ANALYSIS OF REPEATED 
MEASUREMENTS AND SURVIVAL TIME 1 

By Donglin Zeng and Jianwen Cai 

University of North Carolina at Chapel Hill 

Maximum likelihood estimation has been extensively used in the 
joint analysis of repeated measurements and survival time. However, 
there is a lack of theoretical justification of the asymptotic properties 
for the maximum likelihood estimators. This paper intends to fill this 
gap. Specifically, we prove the consistency of the maximum likelihood 
estimators and derive their asymptotic distributions. The maximum 
likelihood estimators are shown to be semiparametrically efficient. 

1. Introduction. Joint analysis of both repeated measurements and sur- 
vival time has received much attention in the last decade. The motivation 
of such analysis arises from many medical studies. For example, in an HIV 
study, the progression of CD4 cell counts in HIV patients and the time to pa- 
tients' death are of interest. Three different types of questions can be asked. 
One may wish to know the effect of a particular factor, such as age at the 
entry, on both the progression of CD4 cell counts and the risk of death. In- 
terest may also arise in studying the longitudinal pattern of CD4 cell counts 
over a time period; however, the longitudinal path can be truncated due to 
drop-out or death. In another analysis, one may focus on how the actual 
CD4 cell count predicts the risk of HIV-related disease, where the true CD4 
cell count is often missing or measured with error. 

To answer these or similar questions in other medical studies, joint models 
for repeated measurements and survival time have been proposed. In gen- 
eral, a mixed-effects model (Chapter 9 in [7]) with normal random effects is 
used to model repeated measurements, while a proportional intensity model 
[2] is used to model the hazard function of survival time. The covariates in 
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the models can be subjects' baseline variables, study time or error-free CD4 
cell counts, and so on. Random effects are used in both the mixed model 
and the proportional hazards model to account for the dependence between 
repeated measurements and survival time due to unobserved heterogeneity. 
In some of the literature, such a joint model is described as either a selection 
model or a pattern-mixture model, depending upon how they are derived. 
When the conditional distribution of survival time given repeated measure- 
ments is modeled, the derived joint model is called a selection-model; when 
the conditional distribution of repeated measurement given survival time is 
modeled, the derived joint model is called a pattern-mixture model. Selec- 
tion models have been studied by many authors in different contexts, for 
example, Tsiatis, DeGruttola and Wulfsohn [23], Wulfsohn and Tsiatis [27] 
and Xu and Zeger [28, 29]. On the other hand, Wu and Carroll [26], Wu 
and Bailey [25] and Hogan and Laird [12] proposed pattern-mixture models. 
In some studies where repeated measurements are considered as an inter- 
nal covariate predicting risk of survival time, joint analysis is regarded as a 
missing covariate problem or measurement error problem in a proportional 
hazards model. For instance, Chen and Little [5] considered the missing co- 
variate problem in the proportional hazards model, although the covariates 
there were assumed to be time-independent. One referee drew our attention 
to the paper by Dupuy, Grama and Mesbah [8]. In their paper, the authors 
stated the asymptotic results for the proportional hazards model with time- 
dependent covariate, where the covariate was modeled using a parametric 
distribution instead of a mixed-effects model and the covariate was assumed 
to be measured without error. 

In most of the joint analysis literature, nonparametric maximum like- 
lihood estimation has been proposed (e.g., [23, 27]). Here, nonparametric 
maximum likelihood estimation means that the nuisance parameter, for ex- 
ample, the cumulative baseline hazard function in the proportional hazards 
model, can be a function with jumps at some discrete observations (for a 
complete review of the nonparametric maximum likelihood estimation in a 
proportional hazards model, refer to [15]). Computationally, the EM algo- 
rithm [6] has often been used to calculate the maximum likelihood estimates, 
where random effects are treated as missing. 

However, although the maximum likelihood estimates have been shown to 
perform well in numerical studies [13], theoretical justification of the asymp- 
totic properties of the maximum likelihood estimates has not been well es- 
tablished, except for Chen and Little [5], who thoroughly studied this issue 
in a proportional hazards model with missing time-independent covariates. 
Therefore, in this paper we aim to derive the asymptotic properties of the 
maximum likelihood estimators in the joint models. Specifically, we rigor- 
ously prove the consistency of the maximum likelihood estimators and derive 
their asymptotic distributions. Our theoretical results further confirm that 
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nonparametric maximum likelihood estimation, which has been proposed 
in the literature [12, 23, 27], provides efficient estimation. Additionally, we 
show that the profile likelihood function can be used to give a consistent 
estimator for the asymptotic variance of the regression coefficients. 

The structure of this paper is as follows. A general frame for modeling 
both repeated measurements and survival time is given in Section 2. The 
EM algorithm for maximum likelihood estimation is briefly described after- 
ward. Section 3 gives our main results on the asymptotic properties of the 
maximum likelihood estimators. The proofs for the main theorems are given 
in Sections 4 and 5. Some technical details are provided in the Appendix. 

2. Maximum likelihood estimation in joint models. 

2.1. Models and assumptions. In the joint analysis of repeated measure- 
ments and survival times, repeated measurements are considered as the re- 
alizations of a certain marker process (usually an internal covariate, Sec- 
tion 6.3.2 in [16]) at finite time points. We use Y(t) to denote the value of 
such a marker process at time t and we introduce a counting process R(t) 
(cf. II. 1 of [1]) which is right-continuous and only jumps at time t where 
a measurement is taken. Furthermore, we use Nj<(t) and Nc(t) to denote 
the counting processes generated by survival time T and censoring time C, 
respectively; that is, N T (t) = I(T < t) and N c (t) = I(C < t), where /(•) is 
the indicator function. Both Y(t) and T are outcome variables of interest. 

One essential assumption in all of the joint models proposed in the liter- 
ature is that the association between the marker process and the survival 
time is due to observed covariate processes such as baseline information or 
study time, and so on, which are denoted by X(t), and unobserved subject- 
specific effects, which are denoted by a. However, the previous statement is 
vague, and so we will provide more rigorous assumptions in the following. 
To do that, we introduce more notation: we denote Hx{t) as the longitu- 
dinal covariate history prior to time t and denote Hy(t) as the longitu- 
dinal response history prior to time t; that is, Hx(t) = {X(s) :s < t} and 
Hy(t) = {Y(s):s < t}. Additionally, we denote r as the end time of the 
study. Then we impose the following model assumptions: 

(A.l) Random effect a follows a multivariate normal distribution with mean 
zero and covariance S a . 

(A. 2) For any t E [0, r], the covariate process X(t) is fully observed and 
conditional on a, Hx(t), Hy(t) and T>t, the distribution of X{t) 
depends only on Hx(t). Moreover, with probability one, X(t) is con- 
tinuously differentiable in [0,r] and max 4e [ 0jT ] < oo, where || • || 
denotes the Euclidean norm in real space and X'(t) denotes the deriva- 
tive of X(t) with respect to t. 
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(A. 3) Conditional on a, Hx{t), Hy(t), X(t) and T >t, the intensity of the 
counting process iVr(i) at time t is equal to 



where W(t) and W(i) are sub-processes of X(t) and <f> is a constant 
vector of the same dimension as W(t). For any vectors u i and i>2 of the 
same dimension, u i o u 2 is the vector obtained by the component-wise 
product of v\ and t>2- 
(A. 4) Conditional on a, Hx(t), Hy(t), X{t) and T > t, the marker process 
Y"(t) satisfies 



where X(t) and X(t) are sub-processes of (t) and e(t) is a white 
noise process with variance a^. 
(A. 5) Conditional on a, Hx(T), Hy(T), X(T) and T, the intensity of the 
counting process Nc(t) at time t depends only on Hx(t) and X(t) for 
any t <T. 

(A. 6) Conditional on a, Hx(T), fly(T), ✓t(T), T and C, the intensity of 
the counting process R(t) depends only on Hx(t) and X(t) for any 
t <TAC. 

Remark 2.1. The structure models (1) and (2) cover most of the joint 
models proposed in the literature. For example, we consider a clinical trial for 
HIV-patients with two treatment arms and covariates measured at the entry, 
such as age and gender and so on, in addition to the recording of CD4 counts 
along the follow-up. If we are interested in studying the simultaneous effects 
of the treatments on both CD4 count and survival time, after adjusting 
for other^covariates at the entry and subject-specific effects, we can choose 
each of W(i), W(i), X(t) and X(t) to include treatment variable and/or 
covariates measured at the entry. Furthermore, we can use t as a covariate in 
models (1) and (2) to study time-dependent effects. In another situation, if 
Y(t) is considered as a marker process subject to measurement error which 
predicts risk of death (e.g., [23]), then one would use the error- free covariate 
for Y(t), which equals X(t) T /3 + X(t) T a based on (2), as one predictor with 
a coefficient \x in the proportional hazards model (1). Some re-arrangement 
shows that J4ie obtained hazards model is a special case of the expression 
(1), where W(t) = X(t), W(t) = X(t), 4> is the vector with each component 
being fx and 7 = /x/3. 

Remark 2.2. In assumptions (A. 5) and (A. 6), we implicitly assume that 
there exist some appropriate measures such that the intensity functions of 
Nc(t) and R(t) exist. In the missingness context, (A. 5) and (A. 6) are special 



(1) 



A(t) exp{(0 o W(t)) T a + W(t) T 7 }, 



(2) 



Y(t) = X(t) T /3 + X(i) T a + e(t), 
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cases of the coarsening at random assumption [10]. In fact, we can allow (A. 5) 
and (A. 6) to be even more general; for example, the intensities of Nc(t) 
and R(t) can depend on the observed history of repeated measurements. 
Since this generality will not change our subsequent arguments, we choose 
to work with the current simpler assumptions. Assumptions (A. 5) and (A. 6) 
can be further simplified under some special situations: when X(t) is equal 
to g(X; t), where g(-) is a deterministic function and X are random variables 
at time zero, then assumptions (A. 5) and (A. 6) can be replaced with the 
following assumptions: conditional on X, C and R(t) are independent of 
T, Hy(T) and a. Such assumptions have been used in [23]. Another special 
situation when (A. 6) is satisfied is when repeated measurements are obtained 
on fixed schedules, as seen in many cohort studies of clinical experiments. 

Under assumptions (A.1)-(A.6), the conditional distribution function for 
the right-censored event time (Z = T A C, A = I(T < C)), the repeated mea- 
surement process {(Y(t)I(R(t) - R(t-) > 0),R(t)):t < Z} and {X(t):t < 
Z}, given random effects a, can be written as 

/(*(0)) U{f(X(t)\S x (t))} l[{f(Y(t)\a,H x (t),X(t))} Sm 

t<z t<z 

x [] {1 " E[dN T (t)\T > t, a, fl*(t), tfCi)]} 1 -^*) 
t<z 

x {E[dN T (t)\T > t, a, H x (t),X(t)]} SNT{t) 

x E[w c {t)\n x (t),xm 1 - SNa(t) 

t<z 

x {E[dN c (t)\H x (t),X(t)]} m °V 

x J! {1 ~ ^(t)]} 1 -*^ {S[dfl(t)|^(t), ^(t)]} tfB C*) . 

t<z 

Here 5N T (t) = N T (t) - N T (t-), 5N c (t) = N c (t) - N c (t-) and 5R(t) = 
R(t) - R{t-). 

We further assume that the parameters in f(X(0)), f(X(t)\H x (t)), 
E[dN c (t)\H x {t),X(t)} and E[dR{t)\H x (t), X(t)} are distinct from the pa- 
rameters in models (1) and (2). The latter consist of 9 = (a y , Vec(S a ) T , /3 T , 
7 T , 4> T ) T and A(t) = Jq A(s) ds, where Vec(S a ) is the vector consisting of all 
the elements in the upper triangular part of S . Then the observed likeli- 
hood function of (0, A) from n i.i.d. observations is proportional to 

I] / {(2^ y r N ^e W {-(Y l -X.Jfi-±J & f(Y l -X.Jfi-Xf^/2^ y } 
i=i Ja l 



x A(Zi) Al exp 



Ai{(0 o W,(^)) T a + Wi(Zi) T j} 
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(3) 




} 



x(2vr)^/ 2 |I] a |- 1 /2 exp{ _ a ^- 



a/2}da, 



where for subject i, Yj denotes the vector of the observed repeated mea- 
surements, Xj denotes the matrix with each column equal to the observed 
covariate Xj(t) at the time of each measurement, Xj denotes the matrix 
with each column equal to the observed covariate Xj(i) at the time of each 
measurement, Ni is the number of the observed repeated measurements and 
d a is the dimension of a. 

2.2. Maximum likelihood estimation. We can obtain the nonparametric 
maximum likelihood estimates for and A based on (3). To do that, we let 
A(-) be an increasing and right-continuous step-function with jumps only at 
Zi for which Aj = 1. Moreover, the maximum likelihood estimates maximize 
a modified object function of (3), which is obtained by replacing X(Zi) in 
(3) with A{Zi}, the jump size of A(-) at Z^. We denote the logarithm of the 
modified object function as l n (0,A). 

The expectation-maximization (EM) algorithm has been used to calcu- 
late the maximum likelihood estimates. In the EM algorithms, the random 
effects, aj, i = 1, . . . , n, are treated as missing. Thus, the M-step is to solve 
the conditional score equations of the complete data given the observed data. 
Such a conditional expectation is evaluated using the Gaussian-quadrature 
approximation in the E-step (Chapter 5 of [9]). Since A can be derived via 
one-step plug- in in the M-step, such EM algorithms often converge rapidly. 
However, neither the efficiency nor the convergence of the EM algorithm has 
been well justified for this context. 

We denote the maximum likelihood estimates by (6, A). The profile log- 
likelihood function for can be used to estimate the asymptotic covariance 
of 6, as given in [18]. In detail, we define the profile log-likelihood function of 
as pl n (9) = maxAez„ ln(9, A), where Z n consists of all the right-continuous 
step functions only with positive jumps at Zi for which Aj = 1. Then the 
second-order numerical differences of pl n (6) at = 6 can be used to approx- 
imate the asymptotic variance of 6. Especially, for any constant sequence 
h n = 0(n -1 / 2 ) and any constant vector e, 



{nh 2 n ) 1 {pl n (0 + h n e 



)-2pl n (G)+pl n (0-h n e 



)} 



approximates e Ie. Here, I is the efficient information matrix for 6 and it 
is also equal to the inverse of the asymptotic covariance of y/nO. However, 
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this result is not trivial and will be fully stated and justified in the following 
sections. 

Simulation studies conducted in the past literature [5, 13] have indicated 
good performance of the maximum likelihood estimates and the proposed 
variance estimation approach in small samples. 

3. Main results. In this section we provide the asymptotic properties 
of (6, A). Theorem 3.1 concerns the consistency of the estimators; The- 
orem 3.2 gives their asymptotic distribution; the use of the profile log- 
likelihood function is justified by Theorem 3.3. 

In addition to (A.1)-(A.6), some technical assumptions are needed for our 
main theorems. 

(A. 7) Recall that N is the number of observed repeated measurements. 
There exists an integer hq such that P(N < no) = 1. Moreover, 
P(N > d a \H x (T),X(T),T) > with probability one. 
(A. 8) The maximal right-censoring time is equal to r. 
(A. 9) Both P(X r X is full rank) and P(X T X is full rank) are positive. Ad- 
ditionally, if there exist constant vectors Co and Co such that, with 
positive probability, W(i) T Co = /u(t), W(i) o Co = for a determin- 
istic function /i(t) for all t € [0,r], then Co = 0, Co = and fj,(t) = 0. 
(A. 10) The true parameter for 0, denoted by Oq = ((To y , Vec(£oa) T > /3o il'Ei 
<Pq) t , satisfies ||#o|| < Mq, ao y > M _1 , mhi|| e || =1 e T 5]o a e > Mq 1 for 
a known positive constant Mq. 
(A. 11) The true hazard rate function, denoted by \o(y), is bounded and 
positive in [0, r]. 

Remark 3.1. Assumption (A. 7) stipulates that some subjects have at 
least d a repeated measurements. Assumption (A. 8) is equivalent to saying 
that any subject surviving after time r is right-censored at r. Since (A. 2), 
(A. 10) and (A. 11) imply that, conditional on H x (t) and X(t), the proba- 
bility of a subject surviving after time r is at least some positive constant 
c , we conclude that P(C > t\H x (t),X(t)) = P(C = t\H x (t),X(t)) > c 
with probability one. The assumptions given in the first half of (A. 9) are 
the same as the identifiability assumption used in a linear mixed effects 
model, while the second half of (A. 9) is used to identify the regression co- 
efficients in the proportional hazards model (1). When X(i) = [1,X] and 
X(t) = W(t) = W(f) = X for time-independent variables X, assumption 
(A. 9) is equivalent to the linear independence of [1,X] with positive prob- 
ability. Finally, assumption (A. 10) indicates that 6q belongs to a known 
compact set within the domain of 9, denoted by 0. 

We obtain the following theorems. 
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Theorem 3.1. Under assumptions (A.l)-(A.ll), the maximum likeli- 
hood estimator (9, A) is strongly consistent under the product metric of the 
Euclidean norm and the supremum norm on [0,r]; that is, 

||0-0 O || + sup \A(t) - Ao(t)\ ->0 a.s. 

te[o,r] 

Theorem 3.1 states the strong consistency of the maximum likelihood esti- 
mator. Although it is assumed that 9 is bounded, we impose no compactness 
assumption on the estimate A. In fact, obtaining the boundedness of A is 
the key to the proof of Theorem 3.1. The proof will be given in Section 4. 
Once the result of Theorem 3.1 holds, Theorem 3.2 states the asymptotic 
properties of the maximum likelihood estimator. 

Theorem 3.2. Under assumptions (A.l)-(A.ll), y/n(9 — 9 , A — A ) 
weakly converges to a Gaussian random element in R d x /°°[0,r], where d 
is the dimension of and Z°°[0,t] is the metric space of all bounded func- 
tions in [0,r]. Furthermore, y/n(0 — 9 ) weakly converges to a multivariate 
normal distribution with mean zero and its asymptotic variance attains the 
semiparametric efficiency bound for 9q . 

The definition of semiparametric efficiency can be found in Section 3 of [3] . 
Theorem 3.2 declares that 9 is an efficient estimator for 9q. The proof of 
Theorem 3.2 is based on the Taylor expansion of the score equations for 
9 and A around the true parameters 6q and Ao . The key to the proof of the 
theorem is to show that the information operator for (9q,Aq) is continuously 
invertible in an appropriate metric space. The proof of Theorem 3.2 will be 
given in Section 5. 

When both Theorems 3.1 and 3.2 are true, we can verify the smooth 
conditions in Theorem 1 of [18] and show that the profile log-likelihood 
function, pl n (9), approximates a nondegenerate parabolic function around 9. 
Particularly, the inverse of the curvature of the profile log-likelihood function 
at 9 can be used to estimate the asymptotic variance of 9. In other words, 
the following result holds. 

Theorem 3.3. Under assumptions (A.l)-(A.ll), 2{pl n (9) — pl n (9 )} 
weakly converges to a chi-square distribution with d degrees of freedom and, 
moreover, 

pl n (9 + h n e) - 2pl n (9) + pl n (9 - h n e) p T 
U >e Ie ' 

where h n = O p (n" 1 / 2 ), e is any vector in R d with unit norm, and I is the 
efficient information matrix for 9q . 
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Remark 3.2. Specifically, to estimate the (s, I) -element of I, we let 
e s and e/ be canonical basis elements which have ones at the sth coor- 
dinate and /th coordinate, respectively, and have zeros elsewhere. Then the 
(s, Z)-element of I can be estimated by 

-(n/i^)~ 1 {p/ n (0 + h n e s + h n e t ) 

- pl n {0 + h n e s - h n ei) - pl n {0 - h n e s + /i n e z ) + pl n {0)}. 

4. Proof of Theorem 3.1. In this section we prove the consistency result 
for (0,A). We recall that l n (0,A) is equal to 



n . 

i=l Ja 



2\-Ni/2 



x exp{-(Y i - Xf/3 - Xf a) T (Y 4 - Xf/3 - Xf a)/2a2} 
(4) x A{Z l } A 'exp(A i ((0o W i (Z l )) T a + W^Z,)^) 



Zl e (0oW l (t)) r a+W ! (t) r 7dA ^^l 
J 

(v^)- da |5] a |- 1 / 2 exp{-a T 5]- 1 a/2} 



and (0, A) maximizes Z n (0, A) over the space {(0, A) : £ ®,A £ Z n }. 

The proof of the consistency can be established by verifying the following 
steps (i)-(iii) . One particular remark is that all the following arguments are 
made and hold for a fixed oj in the probability space, except for some zero- 
probability set; thus, all the bounds or constants given below may depend 
on this to. 

(i) The maximum likelihood estimate (0,A) exists. 

(ii) We will show that, with probability one, A(r) is bounded as n goes 
to infinity. 

(iii) If (ii) is true, by the Helly selection theorem (cf. page 336 of [4]), we 
can choose a subsequence of A such that A weakly converges to some right- 
continuous monotone function A* with probability 1; that is, the measure 
given by /x([0, t]) =A(i) for t E [0,r] weakly converges to the measure given 
by /i*([0,t]) = A*(i). By choosing a sub-subsequence, we can further assume 
0^0*. Thus, our third step is to show 0* = 0q and A* = Ao- 

Once the above three steps are proved, we conclude that, with probability 1, 
converges to 0q and A weakly converges to Ao in [0, r]. However, since Ao is 
continuous in [0, r], the latter can be strengthened to uniform convergence; 
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that is, sup tg [ 0jT ] | A(t) — Ao(i)| — > almost surely. Hence, Theorem 3.1 is 
proved. 

The proofs of (i)-(iii) are given in the following. 

Proof of (i). It suffices to show that the jump size of A at Z% for which 
Aj = 1 is finite. Since for each i such that Aj = 1, 

A{^}^exp|-^e(^W) T a+w i (t)^ dA(t) | 

< exp{-2((<£ o W l (Z l )f a + W i (Z i ) r 7)}A{Z i }- 1 , 
l n (6,A) is less than 

Slog / (2^)"^/ 2 



i=l 



x exp{-(Y i - Xf /3 - Xf a) T (Y 4 - Xf (3 - Xf a)/2^} 

xexp{-A i ((0oW i (Z i )) T a + W i (Z i ) T 7)} 

x A{Z i }- A '(v / 2^)~' ia |5]ar 1/2 exp{-a T S- 1 a/2} da. 

Thus, if for some i such that Aj = 1 and A{Z,;} — > oo, l n (0,A) — > — oo. We 
conclude that the jump size of A must be finite. On the other hand, belongs 
to a compact set 0. Then the maximum likelihood estimate (6, A) exists. 
□ 

Proof of (ii). Define £ = log A(r) and rescale A by the factor e^. We 
denote the rescaled function as A; thus A(r) = 1. To prove (ii), it is sufficient 
to show £ is bounded. 

Clearly £ maximizes the logdikelihood function l n (6,Ae^). After some 
algebra in expression (4), we obtain that, for any A G Z n , n l n (0,A) is 
equal to 



.^^l g { y^}-ilog{(v / 2^) (i lEa| 1/2 } + -Elog|V i | 1 
n v n n r-r 

-. n -1 n 

+ - E A 4 W,(Z t ) T 7 " - E " ^) T (Y, - Xf3)/2^ 
1 E{(0 o W i (Z i ))A i + Xf (Yj - Xf^/^f 

x Vr!{(0 o W i (Z i ))A i + Xf (Yj - Xf 0)/d*} 



/2 



2n f , 



n f-f 



i=l 



REPEATED MEASUREMENTS AND SURVIVAL 



11 



AilogA{Zi} + log / exp 



T 

a a 



o 



e Qli( *' a ' fl) dA(mda 



where Vj = XfXj/o- 2 + £ a and 
Qu(t, a, 0) = {0 o W^)} T V- 1/2 a + W^tfj 

+ {0 o W,(t)} r Vr 1 {(^ o W i (Z i ))A i + Xf (Y< - Xf ^/a 2 } 
Thus, since < n~ 1 / n (0,e 5 A) - n _1 / n (0, A), it follows that 

e l [ Zl e Q^ a >°) dA(t)\ da 



-$>gJ / ex P 



n 



(5) 



i=i 



1=1 



T 

a a 



^5> 



i=l 



g< / exp 



a J a 



ii(t,a,i 



'dA(t) Uak 



According to assumption (A. 2), there exist some positive constants C\, C2 
and C3 such that \Qu(t, a, 0)\ < Ci||a|| + C2K Yi|| + C3. If we denote ao as 
the standard multivariate normal distribution, from the concavity of the 
logarithm function, 



log / exp 



T 

a a 



e Qii(t,a,0) dA (A da 



(27r) da/2 logE : 



ao 



exp 



Zi 



,Qi;(t,a ,6») 



dA(t) 



> (2vr)^/ 2 log J B ao [exp{-e Cl ll a °ll +C2 ll Yl ll +C73 }] 

>(27r)^/ 2 ^ ao [- e Cl ll a °ll +C2 ll Y 'll +C3 ] 
- _ p C , 2||Y l ||+C 4 

where C4 is another constant. Thus, by the strong law of large numbers and 
assumption (A. 4), 



1 



.-$>g{/exp 



i=l 



T 

a a 



Zi 



e Qu(t,a,0) d Z {t) I da { < I^ e C 2 ||Y 4 ||+C7 4 



i=l 



can be bounded by some constant C5 from above. Then (5) becomes 
1 



0<-X>i£ 
n r— f 



i=l 



n 



i=i 



g< / exp 



T 

a a 



Zi 
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1 



n ( r ( T 



n . 
1=1 



(6) 



+ \ E J (Zi = r) log{ ^ expj - 1± - J £ e QH(*>a,0) dA (t) | da J 
l - J2 m * r) log{ jT expj- ^ } da} + C 5 



< 



77 

1 



E A ^ 



n . 

!=1 



+ ^ E = r) log{ / & exp{-^ - j£ dA(t)} da} 

+ C 6 , 

where Cg is a constant. 

On the other hand, since for any T > and x > 0, T log(l + ^) < T ■ ^ = x, 



wc 



have that e 21 < (1 + f ) • Therefore, 



expj-^ -jJ\ Q ^ a ^dA(t)} 

< exp |_^}|i + e 5^ T e Qu(We) dA(t)/r} _r 



<^r r e X p|-^}| e 5| o T e^(^)dA(t)}" r da 
< J T r exp{-^ - r|} e<?"(^) dA(t)}~ r da. 



Since Qu(t,a,6) > — Ci||a|| — C2 1 1 1 1 — C3, (6) gives that 
1 n 

0<C 6 + -^A^ 



n . 



1 n r 

+ -^/(^ = r)log e" r «r r 

x^expj-^ + C7ir||a||+C7 2 r||Y i ||+C3r} da 



-•n -p n 

< C 6 + - £ Arf - - 5^ I(Zi = r)i + C 7 (T), 



n ~ n . 

i=i 1=1 
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where CV(r) is a deterministic function of T. By the strong law of large 
numbers, re -1 Y%=1 = T ) ~ * = t) > 0. Then we can choose V large 
enough such that ra _1 Er=i A i < ( 2n )~ lr J27=i I( z i = T )- Th us, we obtain 
that 

-p n 

0<C 7 (T) + C 6 -—J2m = r)i 

i=l 

In other words, if we let Bq = exp{4(C6 + CjiT)) /TP{Z = r)}, we conclude 
that A(t) < Bo. Note that the above arguments hold for every sample in the 
probability space except a set with zero probability. Thus, we have shown 
that, with probability 1, A(r) is bounded for any sample size n. □ 

Proof of (hi). In this step we will show that if 6 — > 0* and A weakly 
converges to A* with probability one, then 0* = Bq and A* = Ao. For con- 
venience, we use O to abbreviate the observed statistics (Y, X, X, Z, A, N) 
and {W(s), W(s), < s < Z} and denote G(a, O; 9, A) as 

(2vr^)-^/ 2 {(2vr) d lS Q |}- 1 /2 

(Y-X r /3-X T a) r (Y-X r /3-X T a) a T S" 1 a 



x exp 



+ A((0o W(Z)) T a + W(Z) T 7 ) - [ Z e (Wa+w(t)^^ 

JO 

Moreover, we define 

f a G(a, O; 6>, A) exp{(0 o W(z)) T a + W(z) T 7 } da 



Q(z,Q;6,A) 



/ a G(a,O;0,A)da 



and for any measurable function /(O), we use operator notation to define 
PJ = n- 1 E?=i/(O i ) and Pf = JfdP = E[f(0)}. Thus, P n is the em- 
pirical measure from n i.i.d. observations and yfn^n — P) is the empirical 
process based on these observations (cf. Section 2.1 of [24]). We also de- 
fine a class T = {Q(z, O; 6, A) : z £ [0, r] , 6 G 6, A G Z, A(0) = 0, A(r) < 5 }, 
where Bq is the constant given in step (ii) and Z contains all nondecreasing 
functions in [0,r]. According to Appendix A.l, T is P-Donsker. 

We start to prove (hi). Since (0, A) maximizes the function l n (0, A), where 
A is any step function with jumps only at Zj for which Aj = 1, after dif- 
ferentiating l n (6,A) with respect to A{Z{}, we obtain that A satisfies the 
equation 

k{Z k } = A k /[nP n {I(Z > z)Q(z, O; 0, A)}]\ z=Zk . 
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Thus, imitating the above equation, we can construct another function, de- 
noted by A, such that A is also a step function with jumps only at observed 
Zk and the jump size is given by 

A{Z k } = A k /[nP n {I(Z>z)Q(z,0;0 ,Ao)}]U=z k . 
Equivalently, 

I(Z k < t)A k 



A(t) 



^ P n {I(Z > z)Q(z, O; 9 ,A )}\ z=Zk 



We claim A(t) uniformly converges to Ao(i) in [0, r]. To prove the claim, we 
note that 

I(Z<t)A 



sup 

te[o,T 



A(t) - E 



P{I(Z>z)Q(z,O;0 o ,A o )}\ z=z 



< sup 

te[o,r] 



n 



k=l 



1 



P n {I(Z>z)Q(z,O;9 ,A )} 
1 



+ sup 

t£[0,r] 



< sup 

*6[0,r] 



(Pn-P) 



P{I(Z>z)Q(z,O;6 ,A )} 
I(Z<t)A 



P{I(Z>z)Q(z,O;0 o ,A o )}\ z=z 

1 



1 



+ sup 

t£[0,r 



P n {I(Z>z)Q(z,O;9 ,A )} P{I(Z > z)Q(z, O; O , A )} 

I{Z<t)A 



P) 



P{/(Z>z)Q(z,O;0 o ,A o )}| 2 =z 



According to Appendix A.l, {Q(z, O; #o> Ao) : z G [0,t]} is a bounded and 
Glivenko-Cantelli class. Since {/(Z > z): z £ [0, r]} is also a Glivenko-Cantelli 
class and the functional (/, g) i— ► /g for any bounded two functions / and 3 
is Lipschitz continuous, {I(Z > z)Q(z,0;6q, Aq) : z £ [0,r]} is a Glivenko- 
Cantelli class. Then we obtain that sup 2e [ 0iT i \P n {I(Z > z)Q(z, O; #o> Ao)} — 
P{I(Z > z)Q(z, O; 6q, Ao)} I converges to 0. Moreover, from Appendix A.l, 
P{I(Z > z)Q(z,O;0o,A o )} is larger than P{I(Z > r)exp{-C 8 - C 9 ||Y||}} 
for two constants C$ and Cg, so is bounded from below. Thus, the first term 
on the right-hand side of the above inequality tends to zero. Additionally, 
since the class {I(Z < t)/P{I(Z > z)Q(z, O; ,A )}\ Z=Z : t £ [0,r]} is also a 
Glivenko-Cantelli class, the second term on the right-hand side of the above 
inequality vanishes as n goes to infinity. Therefore, we conclude that A(t) 
uniformly converges to 

I(Z<t)A 



E 



P{I(Z>z)Q(z,O;6 ,A )}\ z=z 
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It is easy to verify that this limit is equal to Ao(t). Thus, our claim that A 
uniformly converges to Ao in [0, r] holds. 
From the construction of A, we obtain that 

« P n {I(Z>z)Q(z,O;e ,A )} 



(7) 



A(t) 



■dA(z). 



P n {I(Z>z)Q(z,O;0,A)} 

A(t) is absolutely continuous with respect to A(t). On the other hand, since 
{I(Z > z):z G [0,t]} and T are both Glivenko-Cantelli classes, {I(Z > 
z)Q(z, O; 0, A) : z G [0,r], G 6, A G Z, A(r) < -B } is also a Glivenko-Cantelli 
class. Thus, 

P){/(Z>z)Q(z,O;0,A)}| 



sup |(P r 

ze[0,r] 



+ sup |(P n ,-P)U(Z>z)Q(z,O;0 o ,A o )}| 

2G[0,r] 







a.s. 



On the other hand, using the bounded convergence theorem and the fact that 
converges to 0* and A weakly converges to A*, P{I{Z > z)Q(z, O; 0, A)} 
converges to P{I(Z > z)Q(z, O; 6*, A*)} for each z; moreover, it is straight- 
forward to check that the derivative of P{I(Z > z)Q(z,O;0,A)} with re- 
spect to z is uniformly bounded, so P{I(Z > z)Q(z, O; 0, A)} is equi-con- 
tinuous with respect to z. Thus, by the Arzela-Ascoli theorem (page 245 
of [21]), uniformly in z G [0, r], 

P{/(Z > z)Q(z, O;0,A)} P{/(Z > 0;6>*, A*)}. 

Then it holds that, uniformly in z G [0,r], 

f 8 ) A{z} _ P n {J(Z > z)Q(z, O; g , A )} P{J(Z > z)Q(z, O; go, A )} 
U A{z} P n {/(Z>z)Q(z,O;0,A)} ^P{/(Z>z)Q(z,0;r,A*)}' 
After taking limits on both sides of (7), we obtain that 

t P{I(Z>z)Q(z,O;0 o ,A o )} 



A*(t) 



■dAo(«). 



/o P{/(Z>z)Q(z,0;6>*,A*)} 

Therefore, since Ao(t) is differentiable with respect to the Lebesgue measure, 
so is A*(t) and we denote \*(t) as the derivative of A*(t). Additionally, 
from (8) we note that A{Z}/A{Z} uniformly converges to dA* \Z) / dAo(Z) = 
X*(Z)/Xq(Z). A second conclusion is that A uniformly converges to A* since 
A* is continuous. 
On the other hand, 

- 1 / n (0,A)-n- 1 / n ,(0 o ,A) 



n 



= Pr 

> 0. 



A log 



A{Z} 



A{Z} 



+ Pr 



log 



/ a G(a,Q;g,A)da 
J a G(a,O;0 o ,A)da 
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Using the result of Appendix A.l and similar arguments as above, we can ver- 
ify that log[/ a G(a, O;0\A)da// a G(a, O;0 o ,A)<ia] belongs to a Glivenko- 
Cantelli class and 



. / a G(a,O;0,A)da 
log 



/ a G(a,O;0 o ,A)da 



log 



/ a G(a,O;0*,A*)da 



/ a G(a,O;0 o ,Ao)da 



Since A{Z} / A{Z} uniformly converges to X*(Z)/Xq(Z), we obtain that 

A*(Z) A / a G(a,O;0*,A*)da" 



log 



A o (Z)A/ a G(a,O;0 o ,Ao)da 



> 0. 



However, the left-hand side of the inequality is the negative Kullback-Leibler 
information. Then it immediately follows that, with probability one, 



(9) 



X*(Z) A [ G(a,O;0*,A*)da = A (^) A f G(a, O; O , A ) da. 

J a J a 



Our proof is completed if we can show 0* = 6q and A* = Ao from (9). 
Since (9) holds with probability one, (9) holds for any (Z,A = 1) and (Z = 
r, A = 0), but may not hold for (Z,A = 0) when Z S (0,r). However, we 
can show that (9) is also true for (Z, A = 0) when Z G (0, r). To see that, 
treating both sides of (9) as functions of Z we integrate these functions over 
the interval (Z,t) to obtain 

/ G(a,O;0*,A*)da| A=OiZ=T - / G(a, O; 0*,A*) da\A=o,z=z 
= / G(a,O;0 o ,Ao)da| A =o,z=T - / G(a,O;0 o ,A o )<ia|A=o ; 

J a J a 



Z=Z- 



After comparing this equality with another equality, which is given by (9) 
at A = and Z = r, we obtain 

/ G(a,O;0*,A*)da| A=o = / G(a, O; O , A ) da| A=0 ; 

J a J a 

that is, (9) also holds for any Z and A = 0. 

Thus, we let A = and Z = in (9). After integrating over a, we have 
that with probability one, 

— l 



1 



+ x t x/ ( t; 2 |- 1 /2 



a 



*N 



x exp 



\K\ 1/2 

(Y-X T /3*) T (Y-X r /3*) 



1 



Y-X T /3*) T X( + 



2af 



X T X 



X T (Y-X T /3 



T a*' 
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(10) 



N 



Oy 
x exp 



Soal 1 / 2 



+ ^V(Y-X T /3 ) T x(s - 1 + 



X T X 



(7, 



Oy 



X r (Y - X T /3 



By comparing the coefficients of YY T , Y and the constant term in the 
exponential parts, we obtain 



1 



(11) 



(12) 
and 

(13) 



X E 



+ 



a 



0// 



-^-X(E^ + 



-1/2 





1 

a* 2 
y 




x T xy 


-l 

X T - 


1 


< / 




X T /3 







1 IE 



0<i 



- 1 + X T X/a / 



-1/2 



fl- 



ic iV 



| S * 1 1/2 



By (A.9), (12) gives /T =/3 . We multiply both sides of (11) by X T from 
the left and by X from the right. According to assumption (A.9), it holds 
that, with positive probability, 



N 



Soal 1 / 2 



1 

a* 1 

U y 

Thus. 



+ 



X T X 

r *2 



X T X 



1 

T *2 



i 



Oy 



^0a + 



X T X 



0y 



X T X 



Oy 



<7* 2 + E*X T X = a\ y + Eo a X T X. 

Combining this result with (13) and by assumption (A. 7), where P(N > 
da\Hx(r), X{t)) > 0, we have that a* = uq v . This further gives E* = Eo a - 

Next, to show that 0* = 4> , 7* = 7 and A* = A , we let A = in (9) and 
notice that (9) can be written as 

-z 



exp 



e (f»W(t)) J a+W(t) J 7 * dA *^ 



exp 



e (^oW(ra + WWi 0(iAo(f) 
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where a follows a normal distribution with mean {X-^X/ctq + S^}{(Y — 
X T /3 ) T X/<7Q, y } T and covariance [X t X/o"q j/ + X^ 1 ] -1 . However, for any- 
fixed X and X, treating X T Y as parameters in this normal family, a is 
the complete statistic for X r Y. Therefore, 

exp |_^ e (^oW(t)ra + W W ^ dA * (t) j 

= exp |_ ^ e (^„oW( t )) r a + W(f 7o dAo(i)}. 

Equivalently, 

e (^oW(t)) T a+W(t) T 7 * A ^^ = e (0 o oW(t)) T a+W(t) T 7oAo ^^ 

According to assumptions (A. 9) and (A. 11), <f>* = O '7* = To anc ^ ^* = ^o- 
□ 

5. Proof of Theorem 3.2. The asymptotic properties for the estimators 
(0, A) follow if we can verify the conditions in [24], Theorem 3.3.1. For 
completeness, we state this theorem below (the version from Appendix A 
of [19]). 

Theorem 5.1 (Theorem 3.3.1 in [24]). Let S n and S be random maps 
and a fixed map, respectively, from ip to a Banach space such that: 

(a) y/n(S n - S)(ip n ) - ^/n{S n - S)(tp ) = o* P (l + y/n\\tp n - ip \\). 

(b) The sequence ^/n(S n — S)(ipo) converges in distribution to a tight 
random element Z. 

(c) The function ip — ► S(i/j) is Frechet differentiable at ipo with a contin- 
uously invertible derivative VS^ (on its range). 

(d) S(ipo) = and V> n satisfies S n (ip n ) = Op(n -1 / 2 ) and converges in outer 
probability to ipQ. 

Then y/n(ip n -ipo)=> -VS^Z. 

In our situation, the parameter tp = (0,A) € & = {(6, A) : \\9 — 0q\\ + 
sup tg [o ir ] \A(t) — Ao(t)| < S} for a fixed small constant 8 (note ^ is a convex 
set). Define a set 

W = {(hi,fe):||h 1 || <l,||Mlv<l}, 
where ||/i2||v is the total variation of h 2 in [0,r] defined as 

m 

sup ^2\h 2 (tj) - h 2 (tj-i)\. 

0=to<t!<t 2 <---<t m =T =1 
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Moreover, we let 

5 n (^)(hi, h 2 ) = P n {le(0, A) T h x + Z A (0, A)[h 2 ]}, 

SW(h ll / l2 ) = P{I 9 (0,A) T h 1 +I A (0,A)[l l2 ]}, 

where lg(0,A) is the first derivative of the log-likelihood function from one 
single subject, denoted by Z(O;0,A), with respect to 0, and Za(0,A)[/i 2 ] is 
the derivative of Z(O;0,A e ) at e = 0, where A e (t) = J*q(1 + eh 2 (s)) dAo(s). 
Thus, it is easy to see that S n and S are both maps from ^ to l°°(7i) and 
\/n{S n (ip) — S(ip)} is an empirical process in the space l°°(Ti.). 
According to Appendix A. 2, the class 

G = jz (0,A) T hi + Z A (0,A)M - k(0 o ,Ao) r hi -lA(Oo,A )[h 2 ], 

\\e-e \\+ sup |A(t)-A (t)|<<y,(hi,/i 2 )GW 

te[o,T] 

is P-Donsker (cf. Section 2.1 of [24]). Moreover, Appendix A. 2 also implies 
that 

sup P[lg(0, A) T hi + 1\(0, A)[h 2 ] - Z e (0 o , A ) T hi - l A (0 o ,A o )[h 2 }} 2 -> 
(hi,ft2)ew 

when ||0 — 0q || + su PtG[o,r] |A(t) — Ao(t)| — > 0. Then (a) follows from Lem- 
ma 3.3.5 of [24]. By the Donsker theorem (Section 2.5 of [24]), (b) holds as 
a result of Appendix A. 2 and the convergence is defined in the metric space 
l°°(Ti.). (d) is true since (0 , A) maximizes P n Z(O;0,A), (0o,Aq) maximizes 
PZ(O;0,A) and (0,A) converges to (0O)Ao) from Theorem 3.1. 

It remains to verify the conditions in (c). The proof of the first half in (c) 
is tedious so we defer it to Appendix A. 3. We only need to prove that VS^, 
is continuously invertible on its range in l°°(7i). From Appendix A. 3, VS 1 ^ 
can be written as follows: for any (0i,Ai) and (0 2 , A 2 ) in fy, 

V-S , V)o (0i-0 2 ,Ai-A 2 )[h 1 ,/i 2 ] 

(14) 

= (0 1 - 2 ) T n 1 [h u h 2 ] + f n 2 [hi, h 2 ] d(Ai - A 2 )(t), 

J 

where both Q\ and Q 2 are linear operators on 7i and SI = (Oi,0 2 ) maps 
HcR d x BV[0, t] to R d x BV[0, r], where BV[0, r] contains all the functions 
with finite total variation in [0,r]. The explicit expressions of Q± and £l 2 are 
given in Appendix A. 3. From (14), we can treat (0\ — 2 ,A\ — A 2 ) as an 
element in l°°{7i) via the following definition: 

(0i - 2 , Ai - A 2 )[hi, h 2 ] = (0i - 2 fhi + f T h 2 {t) d(Ai - A 2 ){t) 

Jo 

V(hi,/i 2 ) x BV[0,t}. 
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Then VS^ can be expanded as a linear operator from l°°(7i) to itself. 
Therefore, if we can show that there exists some positive constant e such 
that eH C n(H), then for any (69, 6A) £ l°°(H), 

\\VS^(56,5A)\\ loo , H) = sup Se T n 1 [h 1 ,h 2 ]+ T n 2 [h 1 ,h 2 ]d5A(t) 

{\iiM)en Jo 

= \\(60,6A)\\ l oo {nm >e\\(SG,6A)\\ l o O{n y 

Hence, V5^, is continuously invertible. 

To prove eH C Q(7i) for some e is equivalent to showing that Q is invert- 
ible. We note from Appendix A. 3 that Q is the summation of an invertible 
operator and a compact operator. According to Theorem 4.25 of [20], to 
prove the invertibility of f2, it is sufficient to verify that Q is one to one: if 
0[hi, h 2 ] = 0, then by choosing 6\ — 2 = eh\ and A\ — A 2 = e j h 2 dAo in 
(14) for a small constant e, we obtain V5^, (hi, / h 2 dAo)[hi, h 2 ] = 0. By the 
definition of VS^ , we notice that the left-hand side is the negative informa- 
tion matrix in the submodel (6q + ehi, Ao + e / h 2 dAo). Therefore, the score 
function along this submodel should be zero with probability one. That is, 
lg{9 , A ) T h! + l A (0 O , A ) [h 2 ] = 0; that is, if we let (hjf , h? , hf , hf , hj) be the 
corresponding components of hi for the parameters (a y , Vec(S a ), (3, 0, 7), 
respectively, and let V a be the symmetric matrix such that Vec(P a ) =hf, 
then with probability one, 



0= / G(a,O;0 o ,A c 

J a 



r a T S f)o 1 P a S no 1 a m . , Nh y i 



+ 



+ 



2 ao y 
(Y - X T (3 - X r a) T (Y - X T (3 - X T a)hf 

X(Y - X T /3 - X T a)hf 



0", 



(15) 



0.y 



+ A{(W(Z) o hf ) T a + W{Z) T hJ} 

-f 

Jo 

Ah 2 (Z) 

-z 



e (*o°W(t)) T a+W(*) r Tro{( W ( t ) Q & + W (i) T h7} dA (t) 

[ G(a,O;0 o ,A o 

Ja 







h 2 (t)e^° w W a+w W~<°dA (t) 



da. 



da 
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Note that (15) holds with probability one, so it may not hold for any Z € 
[0,t] when A = 0. However, if we integrate both sides from Z to r and 
subtract the obtained equation from (15) at A = and Z = r, it is easy to 
show that (15) also holds for any Z E [0, r] when A = 0. Particularly, we let 
A = and Z = in (15), and define 

V a = {So 1 +X T X/ C 7^}- 1 and m a = V a X T (Y - X T f3 )/a% y . 

We obtain 

111 Nh y 
-Tr^V a H^V a ) + -mIS -> a S > a - -Tr^" 1 ^) - — -A 



+ 4((Y-X r /3 f(Y-X r /3 c 



y-(Y - X T /3 ) T XV a X T (Y - X J /3 



'0y 



+ Tr(X T XV a ) + mi X 1 Xm a 



h 



+ {(Y - X T /3 ) T X - (Xm/X}^ = 0. 



Examining the coefficient for (Y — X. t (3q) gives that = 0. The terms 
without (Y - X T /3 ) give 

1 ~ ~ Nh y h y ~ ~ 

(16) - —s- Tr(X T XV a Soa^a) 1 + -f TV(X T XV a ) = 0. 

^<7 n „ cr 0y erg,, 



O.y 

Moreover, the coefficients for the quadratic term (Y — X T /3 )(Y — ~K T (3 ) 
are equal to 

1 



2cr, 



(17) 



4 XV a So 1 2? E - 1 KX T 

0y 



+ 



h 



0", 



1 



0. 



7 - — xv B x J + — xy a x J xv a x J 

Multiplying both sides of (17) by ~K T from the left and by X from the right 
gives 



22 D. ZENG AND J. CAI 

Since X t XV (1 /iJq = I — S^Va, we obtain 



1 



Furthermore, if we multiply the above equation by V~ 1 I]oa from the right 
and take the trace of the matrix, we have 



2ct, 4 



\-Tr(X T XV a V£'D a ) 

0y 



+ -f TV V-^oa - — X J X£ 0a - — X J XV a ] J- = 0. 

0y ^ V a 0y a 0y 



a 0y >- ^ 

After substituting equation (16) into the above equation, we obtain 

| AT + TV (-^X T X£ a) - T¥(V- 1 E 0a )jhi' = 0. 

Thus, h^ = based on assumption (A. 7) and, moreover, from (17), T> a = 0. 
Next, we let A = in (15) and obtain 



eX p|_ £ e (0o°W(t))^ + W( t )^ o dA()(t) | 



Z U oW(i)) T a+W(f 7e 



0, 



x {(W(t) o hf fa + W(t) T h7 + h 2 (t)} dAo(t) 

where a follows a normal distribution with mean 

{X T X/a 2 0y + E 0a 1 }{(Y - X T /3 f X/a 2 J T 

and covariance {X t X/cj^ + S^ 1 }- 1 . However, for any fixed X and X, 
treating X r Y as parameters in this normal family, a is a complete statistic 
for X T Y. Therefore, 

f Z e (*ooW(t))'a+W(t)'7o{(W(t) o hf ) T a + W(t) T h7 + h 2 (t)} dA (t) = 0. 

Jo 

From assumption (A. 9), this immediately gives hf = 0, h7 = and h 2 (t) = 0. 

Since conditions (a)-(d) have been proved, Theorem 3.3.1 of [24] concludes 
that \Jn(Q — Oq,A — Aq) weakly converges to a tight random element in 
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l°°(7i). Moreover, we obtain 

^vs i , (e-e ,A-A )[h 1 ,h 2 } 

(18) 

= v^(Pn - P)Oe(0 o , Ao) T hx + l A (0 , A )[h 2 }} + 0p (l), 

where o p (l) is a random variable which converges to zero in probability 

in l°°(H). From (14), if we denote (hi,/t 2 ) = tt- 1 (h u h 2 ), then (18) can also 
be written as 

V^{(9-e ) T h 1+ [ T h 2 (t)d(A-A )(t)\ 

(19) 1 J ° _ ' _ 

= V^(P« - P){^(6» , Ao) T hx + Z A (0 O , Ao)NI + o P (l). 

In other words, ^Jn(Q — Oq, A — Aq) weakly converges to a Gaussian process 
in l°°(Tt). Particularly, if we choose h 2 = in (19), then hi is an asymp- 
totic linear estimator for Oq hi with influence function being Io{0q, Ao) r hi + 
1\(6q, Aq)[}i 2 ]. Since this influence function is in the linear space spanned by 
the score functions for Oq and Ao, Proposition 3.3.1 in [3] concludes that the 
influence function is the same as the efficient influence function for O^hi; 
that is, is an efficient estimator for Oq and Theorem 3.2 has been proved. 

6. Proof of Theorem 3.3. According to the profile likelihood theory for 
the semiparametric model (Theorem 1 in [18]), we need to construct an 
approximately least favorable submodel and verify all the conditions in that 
theorem. To construct the least favorable submodel, from (19) there exists 
a vector of functions, denoted by h 2 , such that Iq(Oq,Aq) + 1\(0q, Ao)[h2] 
is the efficient score function for Oq. Thus, the least favorable submodel at 
(0, A) is given by f i-> (f , A s (0, A)), where A c (0, A) = A + (f - 0) J h 2 dA. For 
this submodel, conditions (8) and (9) in [18] hold. 

We note that, in the consistency proof of Theorem 3.1, when is not nec- 
essarily the maximum likelihood estimate but 0-!-*0q, the same arguments 
as in the proofs of (ii) and (hi) give that A^, which maximizes /„(<?, A) over 
Z n , is bounded and its limit should be equal to Ao- Thus, condition (10) 
in [18] holds. Condition (11) in [18] can be checked straightforwardly. Fur- 
thermore, using similar arguments as in Appendix A. 2, we can directly check 
the Donsker property of the class 

{v e /(£,A£(0,A)):||£-0 o || + ||0-0o||+ sup |A(f) - A {t)\ < s) 
L te[o,r] ) 

and the Glivenko-Cantelli property of the class 

(vf 5 /(£,A 5 (0,A)):||£-0 o || + ||0-0 o ||+ sup |A(t)-Ao(t)|<4 
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for a small constant 5. 

Hence, Theorem 3.3 follows from Theorem 1 and Corollaries 2 and 3 
in [18]. 

7. Discussion. We have derived the asymptotic properties of the maxi- 
mum likelihood estimators for joint models of repeated measurements and 
survival time. Our results provide a theoretical justification for the maxi- 
mum likelihood estimation in such types of analysis. 

Our proofs can be generalized to obtaining the asymptotic properties 
of the maximum likelihood estimators in other joint models, for example, 
the joint models of repeated measurements and multivariate survival times 
which were discussed by Li and Lin [17] and Huang, Zeger, Anthony and 
Garrett [14], the joint models of repeated measurements and recurrent event 
times [11], and so on. Moreover, based on our proof, we can see that The- 
orems 3.1-3.3 hold even if the random effect, a, has slightly heavier tails 
than the normal density, for example, if the tail approximates exp{— ||a|| a }, 
where 1 < a < 2. We also note that recent work by Tsiatis and Davidian [22] 
used the conditional score equation approach to obtain consistent estimates 
for the regression coefficients in the proportional hazards model (1) without 
any assumptions on random effects. However, their estimates are not effi- 
cient and the applicability of the maximum likelihood estimation under this 
situation is yet unknown. 

APPENDIX 

A.l. Donsker property of T. Recall that T = {Q(z, O; 0, A) : z G [0, r], 9 G 

0, A G A}, where A = {A G 2,A(t) < B }. We can rewrite Q(z,O;0,A) as 

n( n a a\ r> ( n /n^O*, O; 0, A) 
Q{z,O;0,A) = Q 1 (z,O;G) 

Q 3 (z,O;0,A) 

where 
Qi(z,0;B) 

= exp{W(z) T 7 + A((0 o W(z)) + 2X T (Y - X^))^ -1 ^ o W(z))}, 
Q 2 (z,O;0,A) 

= / exp[-^- /' Z exp{(0oW(t)) T V- 1 a + W(t) r 7 + U(*)}dA(t) 
Ja L 2 Jo 



Qs(O;0,A) 



eX p|_^_jT e (^ w (*)) Tv - la + w W T ^dA(t)|da. 
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Here, V = X t X/ct 2 + £~ 1 and U(t) = (fW(i)) r V^((^W(z)) + X r (Y- 
X t /3)At2). 

Using assumption (A. 2), we can easily show that Qi(z, 0,G) is continu- 
ously differentiable with respect to z and and 



|V e Qi(z,O;0)|| + 



^Qi(z,0-G) 



< e Si+ff2||Y|| 



for some positive constants gi and g 2 . Furthermore, it holds that 



\v e Q 2 (z,o-e,A)\\ + 



< 



cxp 



T 

a a 



^Q 2 (z,0;G,A) 

az 



e 33l|a||+ 5 4||Y||+ ff5 D da 



< e 96+97l|Y|| 

and ||VeQ3(0; G, A)|| < e S8+ "H Y H for some positive constants 33,... ,gg. Ad- 
ditionally, 

|Q 2 (z,O;0,A 1 )-Q 2 (z,O;0,A 2 )| 



< 



cxp 



T 

a a 







e (4>ow( t) y v-^+w(ty tt+uw d(Ai _ Aa)(t) da 



< (2^/ 2 



,(0oW(t)) T V- 2 (</)oW(t))/2+W(t) T T+U(t) 



d(Ai-A 2 )(t) 



<(2n) d «/ 2 |A!(t)-A 2 (t)| 

JO 

d 



eft 1 



(0oW(t)) T V- 2 (0oW(t))/2+W(t) T 7+U(t)- 



+ (2vr) da/2 |Ai(Z) - A 2 (Z)|e ( ^ oW(z))Tv ~ 2(0oW(z))/2+w(z)T ^ +u(z) 
< e flio+9HllY||(| Al ( Z ) _ Aa(Z )| + T | Al(t ) _ Aa ( t )| dt 



where g\o and 511 are two positive constants. Similarly, 

\Q 3 (z,0;G,A 1 )-Q 3 (z,0;G,A 2 )\ 

< ePio+sullYH f | Ai(z) _ Aa(z) | + T j Ai(t) _ Az(t) | dt 



On the other hand, there exist positive constants 513 , • • • , <?i6 such that 
|Qi( Z ,O;6»)|<e9 12 +^ll Y H,|Q 2 (2,O;0 ) A)| < (2vr)^/ 2 and Q 3 (z,O;0,A) > 
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J a exp{ — — e 9l4+9l5 " a "-E>o} > 9i6 > 0. Therefore, by the mean- value theo- 
rem, we conclude that, for any (£i,0i,Ai) and (22,^2)^2) in [0,r] x x A, 

\Q(z 1 ,0;e 1 ,A 1 ) - Q(z 2 ,0;6 2 ,A 2 )\ 
(A.l) < e^+^^ho! - 6 2 \\ + |Ai(Z) - A 2 (Z)| 



+ ^ |Ai(i) - A 2 (t)|rft+ ki -z 2 |) 



holds for some positive constants 517 and gis- 

According to Theorem 2.7.5 in [24], the entropy number for the class 
A satisfies log N[.](e,A, L 2 (P)) < K/e, where K is a constant. Thus, we 
can find exp{K/e} brackets, {[Lj,J7j]}, to cover the class A such that, for 
each pair of [Lj,Uj], \\Uj — Lj\\ L2 ^ < e. We can further find a partition of 
[0,t] x O, say Si U S 2 U • ■ • , such that the number of partitions is of the 
order (l/e) d+1 and for any (zi,0i) and (z 2 ,6 2 ) in the same partition, their 
Euclidean distance is less than e. Therefore, the partition {S±, S 2 , ■ ■ ■} x 
{[Lj, UA} bracket covers [0,r]x8xi and the total number of the partition 
is of order (l/e) d+l exp{l/e}. Thus, from (A.l), for any S k and [Lj, Uj], the 
set of the functions {Q(z, O; 0, A) : (z, 6) G S k ,A G A, A G [Lj,Uj]} can be 
bracket covered by 

Q(z k ,O;0 k ,Aj) 

- e ^+^ll Y ll| e + \Uj(Z) - Lj(Z)\ + \Uj(t) - Lj{t)\ dtj, 
Q{z k ,0;9 ki kj) 

+ e 9i7+9i8l|Y||| £+ \jj^ z) _ Lj ( Z )\ +£\Uj(t) - Lj{t)\dt\ , 

where {zk,O k ) is a fixed point in Sk and Aj is a fixed function in [Lj,Uj]. 
Note that the L 2 (P) distance between these two functions is less than 0(e). 
Therefore, we have 

d+l 



JV[.](e,^MU a (P))<0(l)Q) e 1/£ 



Furthermore, T has an /^(i-^-integrable covering function, which is equal 
to 0(e^+9^W Y W). From Theorem 2.5.6 in [24], T is P-Donsker. 

In the above derivation, we also note that all the functions in T are 
bounded from below by e _5l9_920 " Y " for some positive constants 519 and 

520- 
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A.2. Donsker property of Q. For any (hi,/i2) G Tt, if we let (h^, hf, Irf , 
hf , hj) be the corresponding components of hi for the parameters (a y , Yi a ,(3, 
0,7), respectively, then le(6, A) T hi + 1\(0, A){fi2] has the expression 

f 

/o 



Aii(0;fl,A) T hi 
+ A/ l2 (Z)- 



Ai 2 (*,O;0,A) T hidA(t) 



H 3 (t,O;0,A)h 2 (t)dA(t), 



where 



/ii(0;fl,A) T hi 



G(a,O;0,A)da 
x / G(a,O;0,A) 



— a ° a - iv^- 1 ^) 



On 



+ 



+ 



(Y - X T /3 - X a) (Y - X T /3 - X T a)hf 



<7 J 



X(Y - X J /3 - X 1 a)h£ 



+ A{(W(Z) o hf ) T a + W(Z) T h7} 



da, 



p 2 (i,O;0,A) T hi 

G(a,O;0,A)daj 

G(a,O;0,A)e^ oW W) Ta+w W T ^ 

L 

x {(W(t) o hf fa + W(t) T h7} da 



and 



M*,O;0,A) 

G(a, O;0,A) da\ I G(a, O; 0, A ) e (^W(t))^ a+W(t)--Y da . 



Here, 2? a is a symmetric matrix such that Vec(2? a ) = h a . 

For j = 1, 2, 3, we denote Vg/Uj and VaA*j [SA] as the derivatives of fij with 
respect to and A along the path A + e<5A. Then using similar calculations to 
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Appendix A.l, it is tedious to verify that Va/^- [<5A] = J /ij + 3(s, O; 6, A) d5A(s) 
and that there exist two positive constants r\ and r 2 such that 

E{i^i + i v ^i}< eri+r2l|feY|1 - 

On the other hand, Jjy_ the mean value theorem, we have that, for any 
(0,A,hi,/i 2 ) and (0,A,h 1 ,h 2 ) in f x H, 

le(e,A) T h 1 + l A (9,A)[h 2 ] - leiejfhi - l k {G,A)\jfi 2 ] 

= (0- 0) T VeVi(O; 9*, A*)h a + f M4 (t, O; 0* , A*) T hx d(A - A)(i) 

JO 

- f (0-0) T VgiM l (t,OiB*,A*)h 1 dA.(t) 
Jo 

z t 

-[ [ A t 5 (s,0;©*,A*)d(A-A)(s)hidA(t) 
Jo jo 

- / /i 2 (t,O;0*,A*) T hid(A-A)(i) 
Jo 

(A.2) - [ Z (6 -6) T V e ^(t,0;e*,A*)h 2 (t)dA(t) 
Jo 

- / Z f ^(s,O;0*,A*)d(A-A)(s)h 2 (t)dA(t) 
Jo Jo 







Z / x 3 (t,0 ; r,A*)/ l2 (t)d(A-A)(t) 



+ // 1 (O;0,A) T (h 1 -h 1 )- / /z 2 (i,O;0,A) T (hi-hi)dA(t) 

Jo 

+ A(/i 2 (Z) - fc 2 (Z)) - / /i 3 (t, O; e,A)(h 2 (t) - h 2 (t)) dA(t), 

Jo 

where (0*,A*) is equal to e*(0, A) + (1 - e*){6, A) for some e* £ [0,1]. Thus, 
\l e (0, A) T hi + l A (0, A) [Z^] - Z e (0, A) T hi -l A (0, A) M | 
< e n+ra||Y||||| fl _ 0|| + || hi _ hid + |A(Z) - A(Z)| 

+ f T \A(t) - A(t)\[dt + d\h 2 (t)\ + d\h 2 (t)\] 



+ \h 2 (Z) - h 2 (Z)\ + f \h 2 (t) - /»2(t)|[dAi(t) + dA 2 (t)] 
Jo 



where d|/i 2 (f)| = d/ij (i) + d/i 2 (t) and d|/i 2 (i)| = dh% (t) + dh^{t). There- 
fore, by using the same arguments as in Appendix A.l and noting that 
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logJVr.i(e, {h 2 : H/12II1/ < ^i},-^2(Q)) < ^/e for a constant B\ and any prob- 
ability measure Q where K is a constant (Theorem 2.7.5 of [24]), we obtain 

log iV H ( £ , £, L 2 (P)) <0^ + log e 

Hence, is P-Donsker. 

Furthermore, from (A. 2) we can calculate that 

|Ze(0,A) T hi + l A (0,A)[h 2 ] - lo(0 , A ) T hi - Z A (0 O , A )[/i 2 ]| 

< e n+r a ||Y||||| fl _ 0O || + | A(Z) _ Aq(Z) | + ^ | A(t) _ Ao(t) | d A 



+ 







At3 (t,O;0*,A> 2 (t)d(A-A o )(t) 



If ||0 — 0q|| ~~ * and sup tg [ 0jT ] |A(t) — Ao(t)| — ► 0, the above expression con- 
verges to zero uniformly. Thus, 

sup P[l e {0, A) T hx + l A (0, A) [h 2 ] - l (O o , A<,) T hi - Z A (0 O , A )[h 2 }} 2 -> 0. 
(h!,h 2 )ew 

A. 3. Derivative operator VS^. From (A. 2) we can obtain that 
lg(6, A) T h x + l A (6, A) [h 2 ] - le(Oo, A ) T h! - l A (0 Q , A ) [ftjj] 

= (0-0 o ) T |v^ 1 (O ; r,A*)-^ Z V^ 2 (t,O;r,A*)dAo(t)|h 1 

+ jT /(t < Z)L 4 (t, O; 0* , A*) - fi 2 (t, O; 0*, A*) 

- /x 5 (t, O; 0*, A*) dA (s) J d(A - Ao)(t) 

- (0 - o f T /(* < Z)VeHz{t, O; 0\A*)h 2 {t) dA (t) 

Jo 

- jT{l(t < Z)/x 6 (t, O; 0* , A*) ^ ^(a) dA (s) 



+ /(* < Z)/i 3 (t, O; 0*,A*)/i 2 (t)|d(A - A )(t). 

Then it is clear that 

vs^(e-e ,A-Ao)\h lt h2] 

= (0 - o ) T p{ V e //i(0; O , A ) - ^ V 0/ u 2 (t, O; O , A ) dA (t)}h : 
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+ hf^ T p I(t<Z)^ fM (t,O;9 ,A ) 
-/i 2 (i,O;0o,A o ) 



-/x 5 (t,0;6>o,Ao)^ dA (s) 

(0 - O ) T [ T P{I(t < Z) V e /i 3 (t, O; O , A )}h 2 {t) dA (t) 
Jo 

J* p|/(i < Z)/i 6 (t, O; 6> , A ) jf /12(a) dA (s) 



d(A-A )(t) 



+ I(t < Z)fi 3 (t, O; O , A )/t 2 (t) J- d(A - A )(i). 

It is tedious to check that, for j = 1,2, ... ,6, 

sup ||^(t,O;0*,A*)-^,O;0o,A o )|| 

te[o,r] 

< e r3+r4||Y||||| fl *_ flo || + gup |A*(t)-A (t)|). 

I te\o,r} J 



Thus, 



P[Z (0,A) T h 1 +/ A (0,A)[/ l2 ]-Z (0 o ,Ao) T h 1 -Z A (0o,Ao)N] 
= V^ (6>-6>o,A-A )[h 1 ,/ i2 ] 



+ o ||0-0 O || + sup |A(t)-Ao(t)| (||hi|| + \\h 2 \\ v ). 
V te[o,r] J 

Therefore, S(ipo) is Frechet differentiable. 

We can rewrite VS^ (6 — O , A — A )[hi, h 2 ] as hf fii[hi, h 2 ] + J r fi 2 [hi, 
/i 2 ] d(A — Ao), where 

Oi [hi , /i 2 ] = hf p| V e /ii(0; O , A ) - jf* V /x 2 (t, O; O , A ) dA (t) 



P{J(i < Z)V e Mt, O; O , A )}/t 2 (i) dA (t) 



and 



n 2 [hi,/i 2 ] = hfP 



/(t < Z) L 4 (t, O; O , Ao) - fx 2 (t, O; O , A ; 

- fx 5 (t,O;9 ,A ) j dA Q (s 
p|/(t < Z)p 6 (i, O; O , A ) jf /12(a) dA (s 
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-P{/(t<Z)/x 3 (t,0;6»o,Ao)}/i 2 (t). 

Then the operator Q, = (Oi,^) is the bounded linear operator from R d x 
BV[0,r] to itself. Moreover, we note that Q = A + (Ki,K2), where 
A(ht,h 2 ) = (h!,-P{/(t < Z)pi 3 {t,O;0 ,A )}h 2 (t)), KiChi,^) = Q 1 \h 1 , 
/12] — hi and 



K 2 (h 1 ,/i 2 )=h J 1 P 



I(t < Z){m(t, O; Go, A ) - fMi(t, O; O , A ) 

r z 

- H S (t,O;e ,Ac) J dA (s) 
p| I(t < Z)fi 6 (t, O; O , A ) ^ /i 2 (s) dA (s)j. 



Obviously, A is invertible. Ki maps into a finite-dimensional space, so it is 
compact. The image of K2 is a continuously differentiable function in [0, r]. 
According to the Arzela-Ascoli theorem (page 245 in [21]), K2 is a compact 
operator from R d x BV[0,t] to BV[0,t}. Therefore, we conclude that is 
the summation of an invertible operator and a compact operator. 
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